Robotics 56
★ HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang
Recent advancements in vision-language models (VLMs) for common-sense
reasoning have led to the development of vision-language-action (VLA) models,
enabling robots to perform generalized manipulation. Although existing
autoregressive VLA methods leverage large-scale pretrained knowledge, they
disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an
additional diffusion head to predict continuous actions, relying solely on
VLM-extracted features, which limits their reasoning capabilities. In this
paper, we introduce HybridVLA, a unified framework that seamlessly integrates
the strengths of both autoregressive and diffusion policies within a single
large language model, rather than simply connecting them. To bridge the
generation gap, a collaborative training recipe is proposed that injects the
diffusion modeling directly into the next-token prediction. With this recipe,
we find that these two forms of action prediction not only reinforce each other
but also exhibit varying performance across different tasks. Therefore, we
design a collaborative action ensemble mechanism that adaptively fuses these
two predictions, leading to more robust control. In experiments, HybridVLA
outperforms previous state-of-the-art VLA methods across various simulation and
real-world tasks, including both single-arm and dual-arm robots, while
demonstrating stable manipulation in previously unseen configurations.
☆ UniGoal: Towards Universal Zero-shot Goal-oriented Navigation CVPR 2025
In this paper, we propose a general framework for universal zero-shot
goal-oriented navigation. Existing zero-shot methods build inference framework
upon large language models (LLM) for specific tasks, which differs a lot in
overall pipeline and fails to generalize across different types of goal.
Towards the aim of universal zero-shot navigation, we propose a uniform graph
representation to unify different goals, including object category, instance
image and text description. We also convert the observation of agent into an
online maintained scene graph. With this consistent scene and goal
representation, we preserve most structural information compared with pure text
and are able to leverage LLM for explicit graph-based reasoning. Specifically,
we conduct graph matching between the scene graph and goal graph at each time
instant and propose different strategies to generate long-term goal of
exploration according to different matching states. The agent first iteratively
searches subgraph of goal when zero-matched. With partial matching, the agent
then utilizes coordinate projection and anchor pair alignment to infer the goal
location. Finally scene graph correction and goal verification are applied for
perfect matching. We also present a blacklist mechanism to enable robust switch
between stages. Extensive experiments on several benchmarks show that our
UniGoal achieves state-of-the-art zero-shot performance on three studied
navigation tasks with a single model, even outperforming task-specific
zero-shot methods and supervised universal methods.
comment: Accepted to CVPR 2025
☆ NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models
Acquiring physically plausible motor skills across diverse and unconventional
morphologies-including humanoid robots, quadrupeds, and animals-is essential
for advancing character simulation and robotics. Traditional methods, such as
reinforcement learning (RL) are task- and body-specific, require extensive
reward function engineering, and do not generalize well. Imitation learning
offers an alternative but relies heavily on high-quality expert demonstrations,
which are difficult to obtain for non-human morphologies. Video diffusion
models, on the other hand, are capable of generating realistic videos of
various morphologies, from humans to ants. Leveraging this capability, we
propose a data-independent approach for skill acquisition that learns 3D motor
skills from 2D-generated videos, with generalization capability to
unconventional and non-human forms. Specifically, we guide the imitation
learning process by leveraging vision transformers for video-based comparisons
by calculating pair-wise distance between video embeddings. Along with
video-encoding distance, we also use a computed similarity between segmented
video frames as a guidance reward. We validate our method on locomotion tasks
involving unique body configurations. In humanoid robot locomotion tasks, we
demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines
trained on 3D motion-capture data. Our results highlight the potential of
leveraging generative video models for physically plausible skill learning with
diverse morphologies, effectively replacing data collection with data
generation for imitation learning.
☆ DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan
While large multimodal models (LMMs) have demonstrated strong performance
across various Visual Question Answering (VQA) tasks, certain challenges
require complex multi-step reasoning to reach accurate answers. One
particularly challenging task is autonomous driving, which demands thorough
cognitive processing before decisions can be made. In this domain, a sequential
and interpretive understanding of visual cues is essential for effective
perception, prediction, and planning. Nevertheless, common VQA benchmarks often
focus on the accuracy of the final answer while overlooking the reasoning
process that enables the generation of accurate responses. Moreover, existing
methods lack a comprehensive framework for evaluating step-by-step reasoning in
realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new
dataset and benchmark specifically designed to advance step-wise visual
reasoning for autonomous driving. Our benchmark features over 18k VQA examples
in the training set and more than 4k in the test set, covering diverse
questions on perception, prediction, and planning, each enriched with
step-by-step reasoning to ensure logical inference in autonomous driving
scenarios. We further introduce a large multimodal model that is fine-tuned on
our reasoning dataset, demonstrating robust performance in complex driving
scenarios. In addition, we benchmark various open-source and closed-source
methods on our proposed dataset, systematically comparing their reasoning
capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in
final answer accuracy, along with a 3.62% improvement in reasoning score over
the previous best open-source model. Our framework, dataset, and model are
available at https://github.com/ayesha-ishaq/DriveLMM-o1.
comment: 8 pages, 4 figures, 3 tables, github:
https://github.com/ayesha-ishaq/DriveLMM-o1
☆ Towards Safe Path Tracking Using the Simplex Architecture
Robot navigation in complex environments necessitates controllers that are
adaptive and safe. Traditional controllers like Regulated Pure Pursuit, Dynamic
Window Approach, and Model-Predictive Path Integral, while reliable, struggle
to adapt to dynamic conditions. Reinforcement Learning offers adaptability but
lacks formal safety guarantees. To address this, we propose a path tracking
controller leveraging the Simplex architecture. It combines a Reinforcement
Learning controller for adaptiveness and performance with a high-assurance
controller providing safety and stability. Our contribution is twofold. We
firstly discuss general stability and safety considerations for designing
controllers using the Simplex architecture. Secondly, we present a
Simplex-based path tracking controller. Our simulation results, supported by
preliminary in-field tests, demonstrate the controller's effectiveness in
maintaining safety while achieving comparable performance to state-of-the-art
methods.
☆ NuExo: A Wearable Exoskeleton Covering all Upper Limb ROM for Outdoor Data Collection and Teleoperation of Humanoid Robots
The evolution from motion capture and teleoperation to robot skill learning
has emerged as a hotspot and critical pathway for advancing embodied
intelligence. However, existing systems still face a persistent gap in
simultaneously achieving four objectives: accurate tracking of full upper limb
movements over extended durations (Accuracy), ergonomic adaptation to human
biomechanics (Comfort), versatile data collection (e.g., force data) and
compatibility with humanoid robots (Versatility), and lightweight design for
outdoor daily use (Convenience). We present a wearable exoskeleton system,
incorporating user-friendly immersive teleoperation and multi-modal sensing
collection to bridge this gap. Due to the features of a novel shoulder
mechanism with synchronized linkage and timing belt transmission, this system
can adapt well to compound shoulder movements and replicate 100% coverage of
natural upper limb motion ranges. Weighing 5.2 kg, NuExo supports backpack-type
use and can be conveniently applied in daily outdoor scenarios. Furthermore, we
develop a unified intuitive teleoperation framework and a comprehensive data
collection system integrating multi-modal sensing for various humanoid robots.
Experiments across distinct humanoid platforms and different users validate our
exoskeleton's superiority in motion range and flexibility, while confirming its
stability in data collection and teleoperation accuracy in dynamic scenarios.
comment: 8 pages
☆ KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
With the rapid advancement of large language models (LLMs) and
vision-language models (VLMs), significant progress has been made in developing
open-vocabulary robotic manipulation systems. However, many existing approaches
overlook the importance of object dynamics, limiting their applicability to
more complex, dynamic tasks. In this work, we introduce KUDA, an
open-vocabulary manipulation system that integrates dynamics learning and
visual prompting through keypoints, leveraging both VLMs and learning-based
neural dynamics models. Our key insight is that a keypoint-based target
specification is simultaneously interpretable by VLMs and can be efficiently
translated into cost functions for model-based planning. Given language
instructions and visual observations, KUDA first assigns keypoints to the RGB
image and queries the VLM to generate target specifications. These abstract
keypoint-based representations are then converted into cost functions, which
are optimized using a learned dynamics model to produce robotic trajectories.
We evaluate KUDA on a range of manipulation tasks, including free-form language
instructions across diverse object categories, multi-object interactions, and
deformable or granular objects, demonstrating the effectiveness of our
framework. The project page is available at http://kuda-dynamics.github.io.
comment: Project website: http://kuda-dynamics.github.io
☆ Learning Robotic Policy with Imagined Transition: Mitigating the Trade-off between Robustness and Optimality
Existing quadrupedal locomotion learning paradigms usually rely on extensive
domain randomization to alleviate the sim2real gap and enhance robustness. It
trains policies with a wide range of environment parameters and sensor noises
to perform reliably under uncertainty. However, since optimal performance under
ideal conditions often conflicts with the need to handle worst-case scenarios,
there is a trade-off between optimality and robustness. This trade-off forces
the learned policy to prioritize stability in diverse and challenging
conditions over efficiency and accuracy in ideal ones, leading to overly
conservative behaviors that sacrifice peak performance. In this paper, we
propose a two-stage framework that mitigates this trade-off by integrating
policy learning with imagined transitions. This framework enhances the
conventional reinforcement learning (RL) approach by incorporating imagined
transitions as demonstrative inputs. These imagined transitions are derived
from an optimal policy and a dynamics model operating within an idealized
setting. Our findings indicate that this approach significantly mitigates the
domain randomization-induced negative impact of existing RL algorithms. It
leads to accelerated training, reduced tracking errors within the distribution,
and enhanced robustness outside the distribution.
☆ World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
Recent advances in large vision-language models (LVLMs) have shown promise
for embodied task planning, yet they struggle with fundamental challenges like
dependency constraints and efficiency. Existing approaches either solely
optimize action selection or leverage world models during inference,
overlooking the benefits of learning to model the world as a way to enhance
planning capabilities. We propose Dual Preference Optimization (D$^2$PO), a new
learning framework that jointly optimizes state prediction and action selection
through preference learning, enabling LVLMs to understand environment dynamics
for better planning. To automatically collect trajectories and stepwise
preference data without human annotation, we introduce a tree search mechanism
for extensive exploration via trial-and-error. Extensive experiments on
VoTa-Bench demonstrate that our D$^2$PO-based method significantly outperforms
existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and
LLaMA-3.2 (11B), achieving superior task success rates with more efficient
execution paths.
☆ Stratified Topological Autonomy for Long-Range Coordination (STALC)
Cora A. Dimmig, Adam Goertz, Adam Polevoy, Mark Gonzales, Kevin C. Wolfe, Bradley Woosley, John Rogers, Joseph Moore
Achieving unified multi-robot coordination and motion planning in complex
environments is a challenging problem. In this paper, we present a hierarchical
approach to long-range coordination, which we call Stratified Topological
Autonomy for Long-Range Coordination (STALC). In particular, we look at the
problem of minimizing visibility to observers and maximizing safety with a
multi-robot team navigating through a hazardous environment. At its core, our
approach relies on the notion of a dynamic topological graph, where the edge
weights vary dynamically based on the locations of the robots in the graph. To
create this dynamic topological graph, we evaluate the visibility of the robot
team from a discrete set of observer locations (both adversarial and friendly),
and construct a topological graph whose edge weights depend on both adversary
position and robot team configuration. We then impose temporal constraints on
the evolution of those edge weights based on robot team state and use
Mixed-Integer Programming (MIP) to generate optimal multirobot plans through
the graph. The visibility information also informs the lower layers of the
autonomy stack to plan minimal visibility paths through the environment for the
team of robots. Our approach presents methods to reduce the computational
complexity for a team of robots that interact and coordinate across the team to
accomplish a common goal. We demonstrate our approach in simulated and hardware
experiments in forested and urban environments.
comment: This work has been submitted to the IEEE for possible publication.
arXiv admin note: text overlap with arXiv:2303.11966
★ Finetuning Generative Trajectory Model with Reinforcement Learning from Human Feedback
Derun Li, Jianwei Ren, Yue Wang, Xin Wen, Pengxiang Li, Leimeng Xu, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, Ningyi Xu, Hang Zhao
Generating human-like and adaptive trajectories is essential for autonomous
driving in dynamic environments. While generative models have shown promise in
synthesizing feasible trajectories, they often fail to capture the nuanced
variability of human driving styles due to dataset biases and distributional
shifts. To address this, we introduce TrajHF, a human feedback-driven
finetuning framework for generative trajectory models, designed to align motion
planning with diverse driving preferences. TrajHF incorporates
multi-conditional denoiser and reinforcement learning with human feedback to
refine multi-modal trajectory generation beyond conventional imitation
learning. This enables better alignment with human driving preferences while
maintaining safety and feasibility constraints. TrajHF achieves PDMS of 93.95
on NavSim benchmark, significantly exceeding other methods. TrajHF sets a new
paradigm for personalized and adaptable trajectory generation in autonomous
driving.
comment: 10 pages, 5 figures
☆ A nonlinear real time capable motion cueing algorithm based on deep reinforcement learning
Hendrik Scheidel, Camilo Gonzalez, Houshyar Asadi, Tobias Bellmann, Andreas Seefried, Shady Mohamed, Saeid Nahavandi
In motion simulation, motion cueing algorithms are used for the trajectory
planning of the motion simulator platform, where workspace limitations prevent
direct reproduction of reference trajectories. Strategies such as motion
washout, which return the platform to its center, are crucial in these
settings. For serial robotic MSPs with highly nonlinear workspaces, it is
essential to maximize the efficient utilization of the MSPs kinematic and
dynamic capabilities. Traditional approaches, including classical washout
filtering and linear model predictive control, fail to consider
platform-specific, nonlinear properties, while nonlinear model predictive
control, though comprehensive, imposes high computational demands that hinder
real-time, pilot-in-the-loop application without further simplification. To
overcome these limitations, we introduce a novel approach using deep
reinforcement learning for motion cueing, demonstrated here for the first time
in a 6-degree-of-freedom setting with full consideration of the MSPs kinematic
nonlinearities. Previous work by the authors successfully demonstrated the
application of DRL to a simplified 2-DOF setup, which did not consider
kinematic or dynamic constraints. This approach has been extended to all 6 DOF
by incorporating a complete kinematic model of the MSP into the algorithm, a
crucial step for enabling its application on a real motion simulator. The
training of the DRL-MCA is based on Proximal Policy Optimization in an
actor-critic implementation combined with an automated hyperparameter
optimization. After detailing the necessary training framework and the
algorithm itself, we provide a comprehensive validation, demonstrating that the
DRL MCA achieves competitive performance against established algorithms.
Moreover, it generates feasible trajectories by respecting all system
constraints and meets all real-time requirements with low...
☆ Compliant Control of Quadruped Robots for Assistive Load Carrying
Nimesh Khandelwal, Amritanshu Manu, Shakti S. Gupta, Mangal Kothari, Prashanth Krishnamurthy, Farshad Khorrami
This paper presents a novel method for assistive load carrying using
quadruped robots. The controller uses proprioceptive sensor data to estimate
external base wrench, that is used for precise control of the robot's
acceleration during payload transport. The acceleration is controlled using a
combination of admittance control and Control Barrier Function (CBF) based
quadratic program (QP). The proposed controller rejects disturbances and
maintains consistent performance under varying load conditions. Additionally,
the built-in CBF guarantees collision avoidance with the collaborative agent in
front of the robot. The efficacy of the overall controller is shown by its
implementation on the physical hardware as well as numerical simulations. The
proposed control framework aims to enhance the quadruped robot's ability to
perform assistive tasks in various scenarios, from industrial applications to
search and rescue operations.
comment: 12 pages, 20 figures
★ LUMOS: Language-Conditioned Imitation Learning with World Models ICRA
We introduce LUMOS, a language-conditioned multi-task imitation learning
framework for robotics. LUMOS learns skills by practicing them over many
long-horizon rollouts in the latent space of a learned world model and
transfers these skills zero-shot to a real robot. By learning on-policy in the
latent space of the learned world model, our algorithm mitigates policy-induced
distribution shift which most offline imitation learning methods suffer from.
LUMOS learns from unstructured play data with fewer than 1% hindsight language
annotations but is steerable with language commands at test time. We achieve
this coherent long-horizon performance by combining latent planning with both
image- and language-based hindsight goal relabeling during training, and by
optimizing an intrinsic reward defined in the latent space of the world model
over multiple time steps, effectively reducing covariate shift. In experiments
on the difficult long-horizon CALVIN benchmark, LUMOS outperforms prior
learning-based methods with comparable approaches on chained multi-task
evaluations. To the best of our knowledge, we are the first to learn a
language-conditioned continuous visuomotor control for a real-world robot
within an offline world model. Videos, dataset and code are available at
http://lumos.cs.uni-freiburg.de.
comment: Accepted at the 2025 IEEE International Conference on Robotics and
Automation (ICRA)
☆ Autonomous Robotic Radio Source Localization via a Novel Gaussian Mixture Filtering Approach
This study proposes a new Gaussian Mixture Filter (GMF) to improve the
estimation performance for the autonomous robotic radio signal source search
and localization problem in unknown environments. The proposed filter is first
tested with a benchmark numerical problem to validate the performance with
other state-of-practice approaches such as Particle Gaussian Mixture (PGM)
filters and Particle Filter (PF). Then the proposed approach is tested and
compared against PF and PGM filters in real-world robotic field experiments to
validate its impact for real-world robotic applications. The considered
real-world scenarios have partial observability with the range-only measurement
and uncertainty with the measurement model. The results show that the proposed
filter can handle this partial observability effectively whilst showing
improved performance compared to PF, reducing the computation requirements
while demonstrating improved robustness over compared techniques.
☆ HALO: Fault-Tolerant Safety Architecture For High-Speed Autonomous Racing
The field of high-speed autonomous racing has seen significant advances in
recent years, with the rise of competitions such as RoboRace and the Indy
Autonomous Challenge providing a platform for researchers to develop software
stacks for autonomous race vehicles capable of reaching speeds in excess of 170
mph. Ensuring the safety of these vehicles requires the software to
continuously monitor for different faults and erroneous operating conditions
during high-speed operation, with the goal of mitigating any unreasonable risks
posed by malfunctions in sub-systems and components. This paper presents a
comprehensive overview of the HALO safety architecture, which has been
implemented on a full-scale autonomous racing vehicle as part of the Indy
Autonomous Challenge. The paper begins with a failure mode and criticality
analysis of the perception, planning, control, and communication modules of the
software stack. Specifically, we examine three different types of faults - node
health, data health, and behavioral-safety faults. To mitigate these faults,
the paper then outlines HALO safety archetypes and runtime monitoring methods.
Finally, the paper demonstrates the effectiveness of the HALO safety
architecture for each of the faults, through real-world data gathered from
autonomous racing vehicle trials during multi-agent scenarios.
comment: 27 pages, 7 figures
☆ Enhanced View Planning for Robotic Harvesting: Tackling Occlusions with Imitation Learning ICRA 2025
In agricultural automation, inherent occlusion presents a major challenge for
robotic harvesting. We propose a novel imitation learning-based viewpoint
planning approach to actively adjust camera viewpoint and capture unobstructed
images of the target crop. Traditional viewpoint planners and existing
learning-based methods, depend on manually designed evaluation metrics or
reward functions, often struggle to generalize to complex, unseen scenarios.
Our method employs the Action Chunking with Transformer (ACT) algorithm to
learn effective camera motion policies from expert demonstrations. This enables
continuous six-degree-of-freedom (6-DoF) viewpoint adjustments that are
smoother, more precise and reveal occluded targets. Extensive experiments in
both simulated and real-world environments, featuring agricultural scenarios
and a 6-DoF robot arm equipped with an RGB-D camera, demonstrate our method's
superior success rate and efficiency, especially in complex occlusion
conditions, as well as its ability to generalize across different crops without
reprogramming. This study advances robotic harvesting by providing a practical
"learn from demonstration" (LfD) solution to occlusion challenges, ultimately
enhancing autonomous harvesting performance and productivity.
comment: Accepted at ICRA 2025
☆ OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Open Semantic Mapping (OSM) is a key technology in robotic perception,
combining semantic segmentation and SLAM techniques. This paper introduces a
dynamically configurable and highly automated LLM/LVLM-powered pipeline for
evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark).
The study focuses on evaluating state-of-the-art semantic mapping algorithms
under varying indoor lighting conditions, a critical challenge in indoor
environments. We introduce a novel dataset with simulated RGB-D sequences and
ground truth 3D reconstructions, facilitating the rigorous analysis of mapping
performance across different lighting conditions. Through experiments on
leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the
semantic fidelity of object recognition and segmentation. Additionally, we
introduce a Scene Graph evaluation method to analyze the ability of models to
interpret semantic structure. The results provide insights into the robustness
of these models, forming future research directions for developing resilient
and adaptable robotic systems. Our code is available at
https://be2rlab.github.io/OSMa-Bench/.
comment: Project page: https://be2rlab.github.io/OSMa-Bench/
☆ 6D Object Pose Tracking in Internet Videos for Robotic Manipulation ICLR 2025
Georgy Ponimatkin, Martin Cífka, Tomáš Souček, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
We seek to extract a temporally consistent 6D pose trajectory of a
manipulated object from an Internet instructional video. This is a challenging
set-up for current 6D pose estimation methods due to uncontrolled capturing
conditions, subtle but dynamic object motions, and the fact that the exact mesh
of the manipulated object is not known. To address these challenges, we present
the following contributions. First, we develop a new method that estimates the
6D pose of any object in the input image without prior knowledge of the object
itself. The method proceeds by (i) retrieving a CAD model similar to the
depicted object from a large-scale model database, (ii) 6D aligning the
retrieved CAD model with the input image, and (iii) grounding the absolute
scale of the object with respect to the scene. Second, we extract smooth 6D
object trajectories from Internet videos by carefully tracking the detected
objects across video frames. The extracted object trajectories are then
retargeted via trajectory optimization into the configuration space of a
robotic manipulator. Third, we thoroughly evaluate and ablate our 6D pose
estimation method on YCB-V and HOPE-Video datasets as well as a new dataset of
instructional videos manually annotated with approximate 6D object
trajectories. We demonstrate significant improvements over existing
state-of-the-art RGB 6D pose estimation methods. Finally, we show that the 6D
object motion estimated from Internet videos can be transferred to a 7-axis
robotic manipulator both in a virtual simulator as well as in a real world
set-up. We also successfully apply our method to egocentric videos taken from
the EPIC-KITCHENS dataset, demonstrating potential for Embodied AI
applications.
comment: Accepted to ICLR 2025. Project page available at
https://ponimatkin.github.io/wildpose/
☆ CODEI: Resource-Efficient Task-Driven Co-Design of Perception and Decision Making for Mobile Robots Applied to Autonomous Vehicles
This paper discusses the integration challenges and strategies for designing
mobile robots, by focusing on the task-driven, optimal selection of hardware
and software to balance safety, efficiency, and minimal usage of resources such
as costs, energy, computational requirements, and weight. We emphasize the
interplay between perception and motion planning in decision-making by
introducing the concept of occupancy queries to quantify the perception
requirements for sampling-based motion planners. Sensor and algorithm
performance are evaluated using False Negative Rates (FPR) and False Positive
Rates (FPR) across various factors such as geometric relationships, object
properties, sensor resolution, and environmental conditions. By integrating
perception requirements with perception performance, an Integer Linear
Programming (ILP) approach is proposed for efficient sensor and algorithm
selection and placement. This forms the basis for a co-design optimization that
includes the robot body, motion planner, perception pipeline, and computing
unit. We refer to this framework for solving the co-design problem of mobile
robots as CODEI, short for Co-design of Embodied Intelligence. A case study on
developing an Autonomous Vehicle (AV) for urban scenarios provides actionable
information for designers, and shows that complex tasks escalate resource
demands, with task performance affecting choices of the autonomy stack. The
study demonstrates that resource prioritization influences sensor choice:
cameras are preferred for cost-effective and lightweight designs, while lidar
sensors are chosen for better energy and computational efficiency.
comment: 20 pages, 33 images, IEEE Transactions on Robotics
★ SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence
Integration of Vision-Language Models (VLMs) in surgical intelligence is
hindered by hallucinations, domain knowledge gaps, and limited understanding of
task interdependencies within surgical scenes, undermining clinical
reliability. While recent VLMs demonstrate strong general reasoning and
thinking capabilities, they still lack the domain expertise and task-awareness
required for precise surgical scene interpretation. Although Chain-of-Thought
(CoT) can structure reasoning more effectively, current approaches rely on
self-generated CoT steps, which often exacerbate inherent domain gaps and
hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent
framework that delivers transparent, interpretable insights for most tasks in
robotic-assisted surgery. By employing specialized CoT prompts across five
tasks: instrument recognition, action recognition, action prediction, patient
data extraction, and outcome assessment, SurgRAW mitigates hallucinations
through structured, domain-aware reasoning. Retrieval-Augmented Generation
(RAG) is also integrated to external medical knowledge to bridge domain gaps
and improve response reliability. Most importantly, a hierarchical agentic
system ensures that CoT-embedded VLM agents collaborate effectively while
understanding task interdependencies, with a panel discussion mechanism
promotes logical consistency. To evaluate our method, we introduce
SurgCoTBench, the first reasoning-based dataset with structured frame-level
annotations. With comprehensive experiments, we demonstrate the effectiveness
of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12
robotic procedures, achieving the state-of-the-art performance and advancing
explainable, trustworthy, and autonomous surgical assistance.
☆ SCOOP: A Framework for Proactive Collaboration and Social Continual Learning through Natural Language Interaction andCausal Reasoning
Dimitri Ognibene, Sabrina Patania, Luca Annese, Cansu Koyuturk, Franca Garzotto, Giuseppe Vizzari, Azzurra Ruggeri, Simone Colombani
Multimodal information-gathering settings, where users collaborate with AI in
dynamic environments, are increasingly common. These involve complex processes
with textual and multimodal interactions, often requiring additional structural
information via cost-incurring requests. AI helpers lack access to users' true
goals, beliefs, and preferences and struggle to integrate diverse information
effectively.
We propose a social continual learning framework for causal knowledge
acquisition and collaborative decision-making. It focuses on autonomous agents
learning through dialogues, question-asking, and interaction in open, partially
observable environments. A key component is a natural language oracle that
answers the agent's queries about environmental mechanisms and states, refining
causal understanding while balancing exploration or learning, and exploitation
or knowledge use.
Evaluation tasks inspired by developmental psychology emphasize causal
reasoning and question-asking skills. They complement benchmarks by assessing
the agent's ability to identify knowledge gaps, generate meaningful queries,
and incrementally update reasoning. The framework also evaluates how knowledge
acquisition costs are amortized across tasks within the same environment.
We propose two architectures: 1) a system combining Large Language Models
(LLMs) with the ReAct framework and question-generation, and 2) an advanced
system with a causal world model, symbolic, graph-based, or subsymbolic, for
reasoning and decision-making. The latter builds a causal knowledge graph for
efficient inference and adaptability under constraints. Challenges include
integrating causal reasoning into ReAct and optimizing exploration and
question-asking in error-prone scenarios. Beyond applications, this framework
models developmental processes combining causal reasoning, question generation,
and social learning.
comment: 5 pages
☆ PRISM: Preference Refinement via Implicit Scene Modeling for 3D Vision-Language Preference-Based Reinforcement Learning
We propose PRISM, a novel framework designed to overcome the limitations of
2D-based Preference-Based Reinforcement Learning (PBRL) by unifying 3D point
cloud modeling and future-aware preference refinement. At its core, PRISM
adopts a 3D Point Cloud-Language Model (3D-PC-LLM) to mitigate occlusion and
viewpoint biases, ensuring more stable and spatially consistent preference
signals. Additionally, PRISM leverages Chain-of-Thought (CoT) reasoning to
incorporate long-horizon considerations, thereby preventing the short-sighted
feedback often seen in static preference comparisons. In contrast to
conventional PBRL techniques, this integration of 3D perception and
future-oriented reasoning leads to significant gains in preference agreement
rates, faster policy convergence, and robust generalization across unseen
robotic environments. Our empirical results, spanning tasks such as robotic
manipulation and autonomous navigation, highlight PRISM's potential for
real-world applications where precise spatial understanding and reliable
long-term decision-making are critical. By bridging 3D geometric awareness with
CoT-driven preference modeling, PRISM establishes a comprehensive foundation
for scalable, human-aligned reinforcement learning.
★ GS-SDF: LiDAR-Augmented Gaussian Splatting and Neural SDF for Geometrically Consistent Rendering and Reconstruction
Digital twins are fundamental to the development of autonomous driving and
embodied artificial intelligence. However, achieving high-granularity surface
reconstruction and high-fidelity rendering remains a challenge. Gaussian
splatting offers efficient photorealistic rendering but struggles with
geometric inconsistencies due to fragmented primitives and sparse observational
data in robotics applications. Existing regularization methods, which rely on
render-derived constraints, often fail in complex environments. Moreover,
effectively integrating sparse LiDAR data with Gaussian splatting remains
challenging. We propose a unified LiDAR-visual system that synergizes Gaussian
splatting with a neural signed distance field. The accurate LiDAR point clouds
enable a trained neural signed distance field to offer a manifold geometry
field, This motivates us to offer an SDF-based Gaussian initialization for
physically grounded primitive placement and a comprehensive geometric
regularization for geometrically consistent rendering and reconstruction.
Experiments demonstrate superior reconstruction accuracy and rendering quality
across diverse trajectories. To benefit the community, the codes will be
released at https://github.com/hku-mars/GS-SDF.
☆ Mapless Collision-Free Flight via MPC using Dual KD-Trees in Cluttered Environments
Collision-free flight in cluttered environments is a critical capability for
autonomous quadrotors. Traditional methods often rely on detailed 3D map
construction, trajectory generation, and tracking. However, this cascade
pipeline can introduce accumulated errors and computational delays, limiting
flight agility and safety. In this paper, we propose a novel method for
enabling collision-free flight in cluttered environments without explicitly
constructing 3D maps or generating and tracking collision-free trajectories.
Instead, we leverage Model Predictive Control (MPC) to directly produce safe
actions from sparse waypoints and point clouds from a depth camera. These
sparse waypoints are dynamically adjusted online based on nearby obstacles
detected from point clouds. To achieve this, we introduce a dual KD-Tree
mechanism: the Obstacle KD-Tree quickly identifies the nearest obstacle for
avoidance, while the Edge KD-Tree provides a robust initial guess for the MPC
solver, preventing it from getting stuck in local minima during obstacle
avoidance. We validate our approach through extensive simulations and
real-world experiments. The results show that our approach significantly
outperforms the mapping-based methods and is also superior to imitation
learning-based methods, demonstrating reliable obstacle avoidance at up to 12
m/s in simulations and 6 m/s in real-world tests. Our method provides a simple
and robust alternative to existing methods.
☆ An Real-Sim-Real (RSR) Loop Framework for Generalizable Robotic Policy Transfer with Differentiable Simulation
Lu Shi, Yuxuan Xu, Shiyu Wang, Jinhao Huang, Wenhao Zhao, Yufei Jia, Zike Yan, Weibin Gu, Guyue Zhou
The sim-to-real gap remains a critical challenge in robotics, hindering the
deployment of algorithms trained in simulation to real-world systems. This
paper introduces a novel Real-Sim-Real (RSR) loop framework leveraging
differentiable simulation to address this gap by iteratively refining
simulation parameters, aligning them with real-world conditions, and enabling
robust and efficient policy transfer. A key contribution of our work is the
design of an informative cost function that encourages the collection of
diverse and representative real-world data, minimizing bias and maximizing the
utility of each data point for simulation refinement. This cost function
integrates seamlessly into existing reinforcement learning algorithms (e.g.,
PPO, SAC) and ensures a balanced exploration of critical regions in the real
domain. Furthermore, our approach is implemented on the versatile Mujoco MJX
platform, and our framework is compatible with a wide range of robotic systems.
Experimental results on several robotic manipulation tasks demonstrate that our
method significantly reduces the sim-to-real gap, achieving high task
performance and generalizability across diverse scenarios of both explicit and
implicit environmental uncertainties.
☆ IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
Motion planning involves determining a sequence of robot configurations to
reach a desired pose, subject to movement and safety constraints. Traditional
motion planning finds collision-free paths, but this is overly restrictive in
clutter, where it may not be possible for a robot to accomplish a task without
contact. In addition, contacts range from relatively benign (e.g., brushing a
soft pillow) to more dangerous (e.g., toppling a glass vase). Due to this
diversity, it is difficult to characterize which contacts may be acceptable or
unacceptable. In this paper, we propose IMPACT, a novel motion planning
framework that uses Vision-Language Models (VLMs) to infer environment
semantics, identifying which parts of the environment can best tolerate contact
based on object properties and locations. Our approach uses the VLM's outputs
to produce a dense 3D "cost map" that encodes contact tolerances and seamlessly
integrates with standard motion planners. We perform experiments using 20
simulation and 10 real-world scenes and assess using task success rate, object
displacements, and feedback from human evaluators. Our results over 3620
simulation and 200 real-world trials suggest that IMPACT enables efficient
contact-rich motion planning in cluttered settings while outperforming
alternative methods and ablations. Supplementary material is available at
https://impact-planning.github.io/.
☆ AhaRobot: A Low-Cost Open-Source Bimanual Mobile Manipulator for Embodied AI
Navigation and manipulation in open-world environments remain unsolved
challenges in the Embodied AI. The high cost of commercial mobile manipulation
robots significantly limits research in real-world scenes. To address this
issue, we propose AhaRobot, a low-cost and fully open-source dual-arm mobile
manipulation robot system with a hardware cost of only $1,000 (excluding
optional computational resources), which is less than 1/15 of the cost of
popular mobile robots. The AhaRobot system consists of three components: (1) a
novel low-cost hardware architecture primarily composed of off-the-shelf
components, (2) an optimized control solution to enhance operational precision
integrating dual-motor backlash control and static friction compensation, and
(3) a simple remote teleoperation method RoboPilot. We use handles to control
the dual arms and pedals for whole-body movement. The teleoperation process is
low-burden and easy to operate, much like piloting. RoboPilot is designed for
remote data collection in embodied scenarios. Experimental results demonstrate
that RoboPilot significantly enhances data collection efficiency in complex
manipulation tasks, achieving a 30% increase compared to methods using 3D mouse
and leader-follower systems. It also excels at completing extremely
long-horizon tasks in one go. Furthermore, AhaRobot can be used to learn
end-to-end policies and autonomously perform complex manipulation tasks, such
as pen insertion and cleaning up the floor. We aim to build an affordable yet
powerful platform to promote the development of embodied tasks on real devices,
advancing more robust and reliable embodied AI. All hardware and software
systems are available at https://aha-robot.github.io.
comment: The first two authors contributed equally. Website:
https://aha-robot.github.io
☆ SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) in continuous environments requires
agents to interpret natural language instructions while navigating
unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage
approach: a waypoint predictor to generate waypoints and a navigator to execute
movements. However, current waypoint predictors struggle with spatial
awareness, while navigators lack historical reasoning and backtracking
capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework
integrating an enhanced waypoint predictor with a Multi-modal Large Language
Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder,
masked cross-attention fusion, and an occupancy-aware loss for better waypoint
quality. The navigator incorporates history-aware reasoning and adaptive path
planning with backtracking, improving robustness. Experiments on R2R-CE and
MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in
zero-shot settings, demonstrating competitive results compared to fully
supervised methods. Real-world validation on Turtlebot 4 further highlights its
adaptability.
☆ V2X-ReaLO: An Open Online Framework and Dataset for Cooperative Perception in Reality
Hao Xiang, Zhaoliang Zheng, Xin Xia, Seth Z. Zhao, Letian Gao, Zewei Zhou, Tianhui Cai, Yun Zhang, Jiaqi Ma
Cooperative perception enabled by Vehicle-to-Everything (V2X) communication
holds significant promise for enhancing the perception capabilities of
autonomous vehicles, allowing them to overcome occlusions and extend their
field of view. However, existing research predominantly relies on simulated
environments or static datasets, leaving the feasibility and effectiveness of
V2X cooperative perception especially for intermediate fusion in real-world
scenarios largely unexplored. In this work, we introduce V2X-ReaLO, an open
online cooperative perception framework deployed on real vehicles and smart
infrastructure that integrates early, late, and intermediate fusion methods
within a unified pipeline and provides the first practical demonstration of
online intermediate fusion's feasibility and performance under genuine
real-world conditions. Additionally, we present an open benchmark dataset
specifically designed to assess the performance of online cooperative
perception systems. This new dataset extends V2X-Real dataset to dynamic,
synchronized ROS bags and provides 25,028 test frames with 6,850 annotated key
frames in challenging urban scenarios. By enabling real-time assessments of
perception accuracy and communication lantency under dynamic conditions,
V2X-ReaLO sets a new benchmark for advancing and optimizing cooperative
perception systems in real-world applications. The codes and datasets will be
released to further advance the field.
☆ LEVA: A high-mobility logistic vehicle with legged suspension ICRA
Marco Arnold, Lukas Hildebrandt, Kaspar Janssen, Efe Ongan, Pascal Bürge, Ádám Gyula Gábriel, James Kennedy, Rishi Lolla, Quanisha Oppliger, Micha Schaaf, Joseph Church, Michael Fritsche, Victor Klemm, Turcan Tuna, Giorgio Valsecchi, Cedric Weibel, Marco Hutter
The autonomous transportation of materials over challenging terrain is a
challenge with major economic implications and remains unsolved. This paper
introduces LEVA, a high-payload, high-mobility robot designed for autonomous
logistics across varied terrains, including those typical in agriculture,
construction, and search and rescue operations. LEVA uniquely integrates an
advanced legged suspension system using parallel kinematics. It is capable of
traversing stairs using a rl controller, has steerable wheels, and includes a
specialized box pickup mechanism that enables autonomous payload loading as
well as precise and reliable cargo transportation of up to 85 kg across uneven
surfaces, steps and inclines while maintaining a cot of as low as 0.15. Through
extensive experimental validation, LEVA demonstrates its off-road capabilities
and reliability regarding payload loading and transport.
comment: Accepted for publication at the 2025 IEEE International Conference on
Robotics and Automation (ICRA). This is the author's preprint version. 6
pages, 8 figures, 2 tables
☆ Post-disaster building indoor damage and survivor detection using autonomous path planning and deep learning with unmanned aerial vehicles
Rapid response to natural disasters such as earthquakes is a crucial element
in ensuring the safety of civil infrastructures and minimizing casualties.
Traditional manual inspection is labour-intensive, time-consuming, and can be
dangerous for inspectors and rescue workers. This paper proposed an autonomous
inspection approach for structural damage inspection and survivor detection in
the post-disaster building indoor scenario, which incorporates an autonomous
navigation method, deep learning-based damage and survivor detection method,
and a customized low-cost micro aerial vehicle (MAV) with onboard sensors.
Experimental studies in a pseudo-post-disaster office building have shown the
proposed methodology can achieve high accuracy in structural damage inspection
and survivor detection. Overall, the proposed inspection approach shows great
potential to improve the efficiency of existing manual post-disaster building
inspection.
comment: 10 pages, 9 figures, accepted in the International Association for
Bridge and Structural Engineering (IABSE) Symposium 2025, Tokyo, Japan
☆ ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network
In recent years, quadruped robotics has advanced significantly, particularly
in perception and motion control via reinforcement learning, enabling complex
motions in challenging environments. Visual sensors like depth cameras enhance
stability and robustness but face limitations, such as low operating
frequencies relative to joint control and sensitivity to lighting, which hinder
outdoor deployment. Additionally, deep neural networks in sensor and control
systems increase computational demands. To address these issues, we introduce
spiking neural networks (SNNs) and event cameras to perform a challenging
quadruped parkour task. Event cameras capture dynamic visual data, while SNNs
efficiently process spike sequences, mimicking biological perception.
Experimental results demonstrate that this approach significantly outperforms
traditional models, achieving excellent parkour performance with just 11.7% of
the energy consumption of an artificial neural network (ANN)-based model,
yielding an 88.3% energy reduction. By integrating event cameras with SNNs, our
work advances robotic reinforcement learning and opens new possibilities for
applications in demanding environments.
☆ RMG: Real-Time Expressive Motion Generation with Self-collision Avoidance for 6-DOF Companion Robotic Arms
The six-degree-of-freedom (6-DOF) robotic arm has gained widespread
application in human-coexisting environments. While previous research has
predominantly focused on functional motion generation, the critical aspect of
expressive motion in human-robot interaction remains largely unexplored. This
paper presents a novel real-time motion generation planner that enhances
interactivity by creating expressive robotic motions between arbitrary start
and end states within predefined time constraints. Our approach involves three
key contributions: first, we develop a mapping algorithm to construct an
expressive motion dataset derived from human dance movements; second, we train
motion generation models in both Cartesian and joint spaces using this dataset;
third, we introduce an optimization algorithm that guarantees smooth,
collision-free motion while maintaining the intended expressive style.
Experimental results demonstrate the effectiveness of our method, which can
generate expressive and generalized motions in under 0.5 seconds while
satisfying all specified constraints.
☆ PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation
Vision-and-language navigation (VLN) tasks require agents to navigate
three-dimensional environments guided by natural language instructions,
offering substantial potential for diverse applications. However, the scarcity
of training data impedes progress in this field. This paper introduces
PanoGen++, a novel framework that addresses this limitation by generating
varied and pertinent panoramic environments for VLN tasks. PanoGen++
incorporates pre-trained diffusion models with domain-specific fine-tuning,
employing parameter-efficient techniques such as low-rank adaptation to
minimize computational costs. We investigate two settings for environment
generation: masked image inpainting and recursive image outpainting. The former
maximizes novel environment creation by inpainting masked regions based on
textual descriptions, while the latter facilitates agents' learning of spatial
relationships within panoramas. Empirical evaluations on room-to-room (R2R),
room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN)
datasets reveal significant performance enhancements: a 2.44% increase in
success rate on the R2R test leaderboard, a 0.63% improvement on the R4R
validation unseen set, and a 0.75-meter enhancement in goal progress on the
CVDN validation unseen set. PanoGen++ augments the diversity and relevance of
training environments, resulting in improved generalization and efficacy in VLN
tasks.
comment: This paper was accepted by Neural Networks
♻ ☆ PCLA: A Framework for Testing Autonomous Agents in the CARLA Simulator
Recent research on testing autonomous driving agents has grown significantly,
especially in simulation environments. The CARLA simulator is often the
preferred choice, and the autonomous agents from the CARLA Leaderboard
challenge are regarded as the best-performing agents within this environment.
However, researchers who test these agents, rather than training their own ones
from scratch, often face challenges in utilizing them within customized test
environments and scenarios. To address these challenges, we introduce PCLA
(Pretrained CARLA Leaderboard Agents), an open-source Python testing framework
that includes nine high-performing pre-trained autonomous agents from the
Leaderboard challenges. PCLA is the first infrastructure specifically designed
for testing various autonomous agents in arbitrary CARLA
environments/scenarios. PCLA provides a simple way to deploy Leaderboard agents
onto a vehicle without relying on the Leaderboard codebase, it allows
researchers to easily switch between agents without requiring modifications to
CARLA versions or programming environments, and it is fully compatible with the
latest version of CARLA while remaining independent of the Leaderboard's
specific CARLA version. PCLA is publicly accessible at
https://github.com/MasoudJTehrani/PCLA.
comment: This work will be published at the FSE 2025 demonstration track
♻ ☆ 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
When interacting with objects, humans effectively reason about which regions
of objects are viable for an intended action, i.e., the affordance regions of
the object. They can also account for subtle differences in object regions
based on the task to be performed and whether one or two hands need to be used.
However, current vision-based affordance prediction methods often reduce the
problem to naive object part segmentation. In this work, we propose a framework
for extracting affordance data from human activity video datasets. Our
extracted 2HANDS dataset contains precise object affordance region
segmentations and affordance class-labels as narrations of the activity
performed. The data also accounts for bimanual actions, i.e., two hands
co-ordinating and interacting with one or more objects. We present a VLM-based
affordance prediction model, 2HandedAfforder, trained on the dataset and
demonstrate superior performance over baselines in affordance region
segmentation for various activities. Finally, we show that our predicted
affordance regions are actionable, i.e., can be used by an agent performing a
task, through demonstration in robotic manipulation scenarios.
comment: Project site: https://sites.google.com/view/2handedafforder
♻ ☆ HumanoidPano: Hybrid Spherical Panoramic-LiDAR Cross-Modal Perception for Humanoid Robots
Qiang Zhang, Zhang Zhang, Wei Cui, Jingkai Sun, Jiahang Cao, Yijie Guo, Gang Han, Wen Zhao, Jiaxu Wang, Chenghao Sun, Lingfeng Zhang, Hao Cheng, Yujie Chen, Lin Wang, Jian Tang, Renjing Xu
The perceptual system design for humanoid robots poses unique challenges due
to inherent structural constraints that cause severe self-occlusion and limited
field-of-view (FOV). We present HumanoidPano, a novel hybrid cross-modal
perception framework that synergistically integrates panoramic vision and LiDAR
sensing to overcome these limitations. Unlike conventional robot perception
systems that rely on monocular cameras or standard multi-sensor configurations,
our method establishes geometrically-aware modality alignment through a
spherical vision transformer, enabling seamless fusion of 360 visual context
with LiDAR's precise depth measurements. First, Spherical Geometry-aware
Constraints (SGC) leverage panoramic camera ray properties to guide
distortion-regularized sampling offsets for geometric alignment. Second,
Spatial Deformable Attention (SDA) aggregates hierarchical 3D features via
spherical offsets, enabling efficient 360{\deg}-to-BEV fusion with
geometrically complete object representations. Third, Panoramic Augmentation
(AUG) combines cross-view transformations and semantic alignment to enhance
BEV-panoramic feature consistency during data augmentation. Extensive
evaluations demonstrate state-of-the-art performance on the 360BEV-Matterport
benchmark. Real-world deployment on humanoid platforms validates the system's
capability to generate accurate BEV segmentation maps through panoramic-LiDAR
co-perception, directly enabling downstream navigation tasks in complex
environments. Our work establishes a new paradigm for embodied perception in
humanoid robotics.
comment: Technical Report
♻ ☆ PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
Understanding the environment and a robot's physical reachability is crucial
for task execution. While state-of-the-art vision-language models (VLMs) excel
in environmental perception, they often generate inaccurate or impractical
responses in embodied visual reasoning tasks due to a lack of understanding of
robotic physical reachability. To address this issue, we propose a unified
representation of physical reachability across diverse robots, i.e.,
Space-Physical Reachability Map (S-P Map), and PhysVLM, a vision-language model
that integrates this reachability information into visual reasoning.
Specifically, the S-P Map abstracts a robot's physical reachability into a
generalized spatial representation, independent of specific robot
configurations, allowing the model to focus on reachability features rather
than robot-specific parameters. Subsequently, PhysVLM extends traditional VLM
architectures by incorporating an additional feature encoder to process the S-P
Map, enabling the model to reason about physical reachability without
compromising its general vision-language capabilities. To train and evaluate
PhysVLM, we constructed a large-scale multi-robot dataset, Phys100K, and a
challenging benchmark, EQA-phys, which includes tasks for six different robots
in both simulated and real-world environments. Experimental results demonstrate
that PhysVLM outperforms existing models, achieving a 14\% improvement over
GPT-4o on EQA-phys and surpassing advanced embodied VLMs such as RoboMamba and
SpatialVLM on the RoboVQA-val and OpenEQA benchmarks. Additionally, the S-P Map
shows strong compatibility with various VLMs, and its integration into
GPT-4o-mini yields a 7.1\% performance improvement.
♻ ☆ ForceGrip: Data-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation
Realistic hand manipulation is a key component of immersive virtual reality
(VR), yet existing methods often rely on a kinematic approach or motion-capture
datasets that omit crucial physical attributes such as contact forces and
finger torques. Consequently, these approaches prioritize tight,
one-size-fits-all grips rather than reflecting users' intended force levels. We
present ForceGrip, a deep learning agent that synthesizes realistic hand
manipulation motions, faithfully reflecting the user's grip force intention.
Instead of mimicking predefined motion datasets, ForceGrip uses generated
training scenarios-randomizing object shapes, wrist movements, and trigger
input flows-to challenge the agent with a broad spectrum of physical
interactions. To effectively learn from these complex tasks, we employ a
three-phase curriculum learning framework comprising Finger Positioning,
Intention Adaptation, and Dynamic Stabilization. This progressive strategy
ensures stable hand-object contact, adaptive force control based on user
inputs, and robust handling under dynamic conditions. Additionally, a proximity
reward function enhances natural finger motions and accelerates training
convergence. Quantitative and qualitative evaluations reveal ForceGrip's
superior force controllability and plausibility compared to state-of-the-art
methods. The video presentation of our paper is accessible at
https://youtu.be/lR-YAfninJw.
comment: 19 pages, 10 figs (with appendix). Demo Video:
https://youtu.be/lR-YAfninJw
♻ ★ AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
AgiBot-World-Contributors, Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, Yuxin Jiang, Cheng Jing, Hongyang Li, Jialu Li, Chiming Liu, Yi Liu, Yuxiang Lu, Jianlan Luo, Ping Luo, Yao Mu, Yuehan Niu, Yixuan Pan, Jiangmiao Pang, Yu Qiao, Guanghui Ren, Cheng Ruan, Jiaqi Shan, Yongjian Shen, Chengshi Shi, Mingkang Shi, Modi Shi, Chonghao Sima, Jianheng Song, Huijie Wang, Wenhao Wang, Dafeng Wei, Chengen Xie, Guo Xu, Junchi Yan, Cunbiao Yang, Lei Yang, Shukai Yang, Maoqing Yao, Jia Zeng, Chi Zhang, Qinglin Zhang, Bin Zhao, Chengyue Zhao, Jiaqi Zhao, Jianchao Zhu
We explore how scalable robot data can address real-world challenges for
generalized robotic manipulation. Introducing AgiBot World, a large-scale
platform comprising over 1 million trajectories across 217 tasks in five
deployment scenarios, we achieve an order-of-magnitude increase in data scale
compared to existing datasets. Accelerated by a standardized collection
pipeline with human-in-the-loop verification, AgiBot World guarantees
high-quality and diverse data distribution. It is extensible from grippers to
dexterous hands and visuo-tactile sensors for fine-grained skill acquisition.
Building on top of data, we introduce Genie Operator-1 (GO-1), a novel
generalist policy that leverages latent action representations to maximize data
utilization, demonstrating predictable performance scaling with increased data
volume. Policies pre-trained on our dataset achieve an average performance
improvement of 30% over those trained on Open X-Embodiment, both in in-domain
and out-of-distribution scenarios. GO-1 exhibits exceptional capability in
real-world dexterous and long-horizon tasks, achieving over 60% success rate on
complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the
dataset, tools, and models, we aim to democratize access to large-scale,
high-quality robot data, advancing the pursuit of scalable and general-purpose
intelligence.
comment: Project website: https://agibot-world.com/. Github repo:
https://github.com/OpenDriveLab/AgiBot-World. The author list is ordered
alphabetically by surname, with detailed contributions provided in the
appendix
♻ ☆ Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation
Bhrij Patel, Kasun Weerakoon, Wesley A. Suttle, Alec Koppel, Brian M. Sadler, Tianyi Zhou, Amrit Singh Bedi, Dinesh Manocha
Reinforcement learning (RL) is a promising approach for robotic navigation,
allowing robots to learn through trial and error. However, real-world robotic
tasks often suffer from sparse rewards, leading to inefficient exploration and
suboptimal policies due to sample inefficiency of RL. In this work, we
introduce Confidence-Controlled Exploration (CCE), a novel method that improves
sample efficiency in RL-based robotic navigation without modifying the reward
function. Unlike existing approaches, such as entropy regularization and reward
shaping, which can introduce instability by altering rewards, CCE dynamically
adjusts trajectory length based on policy entropy. Specifically, it shortens
trajectories when uncertainty is high to enhance exploration and extends them
when confidence is high to prioritize exploitation. CCE is a principled and
practical solution inspired by a theoretical connection between policy entropy
and gradient estimation. It integrates seamlessly with on-policy and off-policy
RL methods and requires minimal modifications. We validate CCE across
REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE
outperforms fixed-trajectory and entropy-regularized baselines, achieving an
18\% higher success rate, 20-38\% shorter paths, and 9.32\% lower elevation
costs under a fixed training sample budget. Finally, we deploy CCE on a
Clearpath Husky robot, demonstrating its effectiveness in complex outdoor
environments.
comment: 10 pages, 6 figures, 2 tables
♻ ☆ Versatile Demonstration Interface: Toward More Flexible Robot Demonstration Collection
Previous methods for Learning from Demonstration leverage several approaches
for a human to teach motions to a robot, including teleoperation, kinesthetic
teaching, and natural demonstrations. However, little previous work has
explored more general interfaces that allow for multiple demonstration types.
Given the varied preferences of human demonstrators and task characteristics, a
flexible tool that enables multiple demonstration types could be crucial for
broader robot skill training. In this work, we propose Versatile Demonstration
Interface (VDI), an attachment for collaborative robots that simplifies the
collection of three common types of demonstrations. Designed for flexible
deployment in industrial settings, our tool requires no additional
instrumentation of the environment. Our prototype interface captures human
demonstrations through a combination of vision, force sensing, and state
tracking (e.g., through the robot proprioception or AprilTag tracking). Through
a user study where we deployed our prototype VDI at a local manufacturing
innovation center with manufacturing experts, we demonstrated VDI in
representative industrial tasks. Interactions from our study highlight the
practical value of VDI's varied demonstration types, expose a range of
industrial use cases for VDI, and provide insights for future tool design.
comment: 8 pages, 6 figures
♻ ☆ Maintaining Strong $r$-Robustness in Reconfigurable Multi-Robot Networks using Control Barrier Functions ICRA
In leader-follower consensus, strong $r$-robustness of the communication
graph provides a sufficient condition for followers to achieve consensus in the
presence of misbehaving agents. Previous studies have assumed that robots can
form and/or switch between predetermined network topologies with known
robustness properties. However, robots with distance-based communication models
may not be able to achieve these topologies while moving through spatially
constrained environments, such as narrow corridors, to complete their
objectives. This paper introduces a Control Barrier Function (CBF) that ensures
robots maintain strong $r$-robustness of their communication graph above a
certain threshold without maintaining any fixed topologies. Our CBF directly
addresses robustness, allowing robots to have flexible reconfigurable network
structure while navigating to achieve their objectives. The efficacy of our
method is tested through various simulation and hardware experiments.
comment: Accepted and will appear at IEEE International Conference on Robotics
and Automation (ICRA) 2025
♻ ☆ Knowledge-data fusion dominated vehicle platoon dynamics modeling and analysis: A physics-encoded deep learning approach
Recently, artificial intelligence (AI)-enabled nonlinear vehicle platoon
dynamics modeling plays a crucial role in predicting and optimizing the
interactions between vehicles. Existing efforts lack the extraction and capture
of vehicle behavior interaction features at the platoon scale. More
importantly, maintaining high modeling accuracy without losing physical
analyzability remains to be solved. To this end, this paper proposes a novel
physics-encoded deep learning network, named PeMTFLN, to model the nonlinear
vehicle platoon dynamics. Specifically, an analyzable parameters encoded
computational graph (APeCG) is designed to guide the platoon to respond to the
driving behavior of the lead vehicle while ensuring local stability. Besides, a
multi-scale trajectory feature learning network (MTFLN) is constructed to
capture platoon following patterns and infer the physical parameters required
for APeCG from trajectory data. The human-driven vehicle trajectory datasets
(HIGHSIM) were used to train the proposed PeMTFLN. The trajectories prediction
experiments show that PeMTFLN exhibits superior compared to the baseline models
in terms of predictive accuracy in speed and gap. The stability analysis result
shows that the physical parameters in APeCG is able to reproduce the platoon
stability in real-world condition. In simulation experiments, PeMTFLN performs
low inference error in platoon trajectories generation. Moreover, PeMTFLN also
accurately reproduces ground-truth safety statistics. The code of proposed
PeMTFLN is open source.
♻ ☆ A Generalized Adaptive Jacobian Controller for Soft Robots
The nonlinearity and hysteresis of soft robot motions have posed challenges
in control. The Jacobian controller is transferred from rigid robot controllers
and exhibits conciseness, but the improper assumption of soft robots induces
the feasibility only in a small local area. Accurate controllers like neural
networks can deal with delayed and nonlinear motion, achieving high accuracy,
but they suffer from the high data amount requirement and black-box property.
Inspired by these approaches, we propose an adaptive generalized Jacobian
controller for soft robots. This controller is constructed by the concise
format of the Jacobian controller but includes more states and independent
matrices, which is suitable for soft robotics. In addition, the initialization
leverages the motor babbling strategy and batch optimization from neural
network controllers. In experiments, we first analyze the online controllers,
including the Jacobian controller, the Gaussian process regression, and our
controller. Real experiments have validated that our controller outperforms the
RNN controller even with fewer data samples, and it is adaptive to various
situations without fine-tuning, like different control frequencies, softness,
and even manufacturing errors. Future work may include online adjustment of the
controller format and adaptability validation in more scenarios.
comment: 10 pages, 8 figures, 4 tables
♻ ☆ A new metaheuristic approach for the art gallery problem
In the problem "Localization and trilateration with the minimum number of
landmarks", we faced the 3-Guard and classic Art Gallery Problems. The goal of
the art gallery problem is to find the minimum number of guards within a simple
polygon to observe and protect its entirety. It has many applications in
robotics, telecommunications, etc. There are some approaches to handle the art
gallery problem that is theoretically NP-hard. This paper offers an efficient
method based on the Particle Filter algorithm which solves the most fundamental
state of the problem in a nearly optimal manner. The experimental results on
the random polygons generated by Bottino et al. \cite{bottino2011nearly} show
that the new method is more accurate with fewer or equal guards. Furthermore,
we discuss resampling and particle numbers to minimize the run time.
comment: This article has undergone many changes and should be reviewed and
rewritten in a different format
♻ ☆ Unified Feedback Linearization for Nonlinear Systems with Dexterous and Energy-Saving Modes
Systems with a high number of inputs compared to the degrees of freedom (e.g.
a mobile robot with Mecanum wheels) often have a minimal set of
energy-efficient inputs needed to achieve a main task (e.g. position tracking)
and a set of energy-intense inputs needed to achieve an additional auxiliary
task (e.g. orientation tracking). This letter presents a unified control
scheme, derived through feedback linearization, that can switch between two
modes: an energy-saving mode, which tracks the main task using only the
energy-efficient inputs while forcing the energy-intense inputs to zero, and a
dexterous mode, which also uses the energy-intense inputs to track the
auxiliary task as needed. The proposed control guarantees the exponential
tracking of the main task and that the dynamics associated with the main task
evolve independently of the a priori unknown switching signal. When the control
is operating in dexterous mode, the exponential tracking of the auxiliary task
is also guaranteed. Numerical simulations on an omnidirectional Mecanum wheel
robot validate the effectiveness of the proposed approach and demonstrate the
effect of the switching signal on the exponential tracking behavior of the main
and auxiliary tasks.
♻ ☆ ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models ICRA-2025
Recent progress in large language models and access to large-scale robotic
datasets has sparked a paradigm shift in robotics models transforming them into
generalists able to adapt to various tasks, scenes, and robot modalities. A
large step for the community are open Vision Language Action models which
showcase strong performance in a wide variety of tasks. In this work, we study
the visual generalization capabilities of three existing robotic foundation
models, and propose a corresponding evaluation framework.
Our study shows that the existing models do not exhibit robustness to visual
out-of-domain scenarios. This is potentially caused by limited variations in
the training data and/or catastrophic forgetting, leading to domain limitations
in the vision foundation models. We further explore OpenVLA, which uses two
pre-trained vision foundation models and is, therefore, expected to generalize
to out-of-domain experiments. However, we showcase catastrophic forgetting by
DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression.
To overcome the aforementioned issue of visual catastrophic forgetting, we
propose a gradual backbone reversal approach founded on model merging. This
enables OpenVLA which requires the adaptation of the visual backbones during
initial training -- to regain its visual generalization ability. Regaining this
capability enables our ReVLA model to improve over OpenVLA by a factor of 77%
and 66% for grasping and lifting in visual OOD tasks .
comment: Accepted at ICRA-2025, Atlanta
♻ ☆ Efficient End-to-End 6-Dof Grasp Detection Framework for Edge Devices with Hierarchical Heatmaps and Feature Propagation
6-DoF grasp detection is critically important for the advancement of
intelligent embodied systems, as it provides feasible robot poses for object
grasping. Various methods have been proposed to detect 6-DoF grasps through the
extraction of 3D geometric features from RGBD or point cloud data. However,
most of these approaches encounter challenges during real robot deployment due
to their significant computational demands, which can be particularly
problematic for mobile robot platforms, especially those reliant on edge
computing devices. This paper presents an Efficient End-to-End Grasp Detection
Network (E3GNet) for 6-DoF grasp detection utilizing hierarchical heatmap
representations. E3GNet effectively identifies high-quality and diverse grasps
in cluttered real-world environments.Benefiting from our end-to-end methodology
and efficient network design, our approach surpasses previous methods in model
inference efficiency and achieves real-time 6-Dof grasp detection on edge
devices. Furthermore, real-world experiments validate the effectiveness of our
method, achieving a satisfactory 94% object grasping success rate.
comment: Accepted by 2025 IEEE International Symposium on Circuits and Systems
♻ ☆ Long-horizon Embodied Planning with Implicit Logical Inference and Hallucination Mitigation
Long-horizon embodied planning underpins embodied AI. To accomplish
long-horizon tasks, one of the most feasible ways is to decompose abstract
instructions into a sequence of actionable steps. Foundation models still face
logical errors and hallucinations in long-horizon planning, unless provided
with highly relevant examples to the tasks. However, providing highly relevant
examples for any random task is unpractical. Therefore, we present ReLEP, a
novel framework for Real-time Long-horizon Embodied Planning. ReLEP can
complete a wide range of long-horizon tasks without in-context examples by
learning implicit logical inference through fine-tuning. The fine-tuned large
vision-language model formulates plans as sequences of skill functions. These
functions are selected from a carefully designed skill library. ReLEP is also
equipped with a Memory module for plan and status recall, and a Robot
Configuration module for versatility across robot types. In addition, we
propose a data generation pipeline to tackle dataset scarcity. When
constructing the dataset, we considered the implicit logical relationships,
enabling the model to learn implicit logical relationships and dispel
hallucinations. Through comprehensive evaluations across various long-horizon
tasks, ReLEP demonstrates high success rates and compliance to execution even
on unseen tasks and outperforms state-of-the-art baseline methods.
♻ ☆ ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark
Ronghao Dang, Yuqian Yuan, Wenqi Zhang, Yifei Xin, Boqiang Zhang, Long Li, Liuyi Wang, Qinyang Zeng, Xin Li, Lidong Bing
The enhancement of generalization in robots by large vision-language models
(LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of
LVLMs based on egocentric videos are of great interest. However, current
datasets for embodied video question answering lack comprehensive and
systematic evaluation frameworks. Critical embodied cognitive issues, such as
robotic self-cognition, dynamic scene perception, and hallucination, are rarely
addressed. To tackle these challenges, we propose ECBench, a high-quality
benchmark designed to systematically evaluate the embodied cognitive abilities
of LVLMs. ECBench features a diverse range of scene video sources, open and
varied question formats, and 30 dimensions of embodied cognition. To ensure
quality, balance, and high visual dependence, ECBench uses class-independent
meticulous human annotation and multi-round question screening strategies.
Additionally, we introduce ECEval, a comprehensive evaluation system that
ensures the fairness and rationality of the indicators. Utilizing ECBench, we
conduct extensive evaluations of proprietary, open-source, and task-specific
LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of
LVLMs, laying a solid foundation for developing reliable core models for
embodied agents. All data and code are available at
https://github.com/Rh-Dang/ECBench.
♻ ☆ LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner ICRA 2025
Language models (LMs) possess a strong capability to comprehend natural
language, making them effective in translating human instructions into detailed
plans for simple robot tasks. Nevertheless, it remains a significant challenge
to handle long-horizon tasks, especially in subtask identification and
allocation for cooperative heterogeneous robot teams. To address this issue, we
propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel
multi-agent task planning framework that achieves state-of-the-art performance
on long-horizon tasks. LaMMA-P integrates the strengths of the LMs' reasoning
capability and the traditional heuristic search planner to achieve a high
success rate and efficiency while demonstrating strong generalization across
tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that
features household tasks with two different levels of complexity based on the
AI2-THOR environment. The experimental results demonstrate that LaMMA-P
achieves a 105% higher success rate and 36% higher efficiency than existing
LM-based multiagent planners. The experimental videos, code, datasets, and
detailed prompts used in each module can be found on the project website:
https://lamma-p.github.io.
comment: IEEE Conference on Robotics and Automation (ICRA 2025); Project
website: https://lamma-p.github.io/
♻ ☆ A Diver Attention Estimation Framework for Effective Underwater Human-Robot Interaction
Many underwater tasks, such as cable-and-wreckage inspection and
search-and-rescue, can benefit from robust Human-Robot Interaction (HRI)
capabilities. With the recent advancements in vision-based underwater HRI
methods, Autonomous Underwater Vehicles (AUVs) have the capability to interact
with their human partners without requiring assistance from a topside operator.
However, in these methods, the AUV assumes that the diver is ready for
interaction, while in reality, the diver may be distracted. In this paper, we
attempt to address this problem by presenting a diver attention estimation
framework for AUVs to autonomously determine the attentiveness of a diver, and
developing a robot controller to allow the AUV to navigate and reorient itself
with respect to the diver before initiating interaction. The core element of
the framework is a deep convolutional neural network called DATT-Net. It is
based on a pyramid structure that can exploit the geometric relations among 10
facial keypoints of a diver to estimate their head orientation, which we use as
an indicator of attentiveness. Our on-the-bench experimental evaluations and
real-world experiments during both closed- and open-water robot trials confirm
the efficacy of the proposed framework.
comment: 9 pages, 6 figures, 2 tables
♻ ☆ Sensor-Invariant Tactile Representation ICLR'25
High-resolution tactile sensors have become critical for embodied perception
and robotic manipulation. However, a key challenge in the field is the lack of
transferability between sensors due to design and manufacturing variations,
which result in significant differences in tactile signals. This limitation
hinders the ability to transfer models or knowledge learned from one sensor to
another. To address this, we introduce a novel method for extracting
Sensor-Invariant Tactile Representations (SITR), enabling zero-shot transfer
across optical tactile sensors. Our approach utilizes a transformer-based
architecture trained on a diverse dataset of simulated sensor designs, allowing
it to generalize to new sensors in the real world with minimal calibration.
Experimental results demonstrate the method's effectiveness across various
tactile sensing applications, facilitating data and model transferability for
future advancements in the field.
comment: Accepted to ICLR'25. Project webpage: https://hgupt3.github.io/sitr/
♻ ☆ Low Fidelity Visuo-Tactile Pretraining Improves Vision-Only Manipulation Performance
Tactile perception is essential for real-world manipulation tasks, yet the
high cost and fragility of tactile sensors can limit their practicality. In
this work, we explore BeadSight (a low-cost, open-source tactile sensor)
alongside a tactile pre-training approach, an alternative method to precise,
pre-calibrated sensors. By pre-training with the tactile sensor and then
disabling it during downstream tasks, we aim to enhance robustness and reduce
costs in manipulation systems. We investigate whether tactile pre-training,
even with a low-fidelity sensor like BeadSight, can improve the performance of
an imitation learning agent on complex manipulation tasks. Through
visuo-tactile pre-training on both similar and dissimilar tasks, we analyze its
impact on a longer-horizon downstream task. Our experiments show that
visuo-tactile pre-training improved performance on a USB cable plugging task by
up to 65% with vision-only inference. Additionally, on a longer-horizon drawer
pick-and-place task, pre-training--whether on a similar, dissimilar, or
identical task--consistently improved performance, highlighting the potential
for a large-scale visuo-tactile pre-trained encoder.